Stripping Adjectives: Integration Techniques for Selective Stemming in SMT Systems
نویسندگان
چکیده
In this paper we present an approach to reduce data sparsity problems when translating from morphologically rich languages into less inflected languages by selectively stemming certain word types. We develop and compare three different integration strategies: replacing words with their stemmed form, combined input using alternative lattice paths for the stemmed and surface forms and a novel hidden combination strategy, where we replace the stems in the stemmed phrase table by the observed surface forms in the test data. This allows us to apply advanced models trained on the surface forms of the words. We evaluate our approach by stemming German adjectives in two German→English translation scenarios: a low-resource condition as well as a large-scale state-of-the-art translation system. We are able to improve between 0.2 and 0.4 BLEU points over our baseline and reduce the number of out-of-vocabulary words by up to 16.5%.
منابع مشابه
Thai Sentence-Breaking for Large-Scale SMT
Thai language text presents challenges for integration into large-scale multilanguage statistical machine translation (SMT) systems, largely stemming from the nominal lack of punctuation and inter-word space. For Thai sentence breaking, we describe a monolingual maximum entropy classifier with features that may be applicable to other languages such as Arabic, Khmer and Lao. We apply this senten...
متن کاملEvaluation of a Dutch stemming algorithm
In state of the art Information Retrieval (IR) systems the most salient problem is to improve recall rates while retaining high precision. A simple recall enhancing technique which can be useful for even the simplest boolean retrieval systems is stemming. It is obvious that an information-seeker who is looking for texts about, for example, dogs is probably interested in a text which contains th...
متن کاملRevisiting Enumerative Instantiation
Formal methods applications often rely on SMT solvers to automatically discharge proof obligations. SMT solvers handle quantified formulas using incomplete heuristic techniques like E-matching, and often resort to model-based quantifier instantiation (MBQI) when these techniques fail. This paper revisits enumerative instantiation, a technique that considers instantiations based on exhaustive en...
متن کاملStemming Hausa text: using affix-stripping rules and reference look-up
Stemming is a process of reducing a derivational or inflectional word to its root or stem by stripping all its affixes. It is been used in applications such as information retrieval, machine translation, and text summarization, as their preprocessing step to increase efficiency. Currently, there are a few stemming algorithms which have been developed for languages such as English, Arabic, Turki...
متن کاملNew techniques for instantiation and proof production in SMT solving. (Nouvelles techniques pour l'instanciation et la production des preuves dans SMT)
In many formal methods applications it is common to rely on SMT solvers to automatically discharge conditions that need to be checked and provide certificates of their results. In this thesis we aim both to improve their efficiency of and to increase their reliability. Our first contribution is a uniform framework for reasoning with quantified formulas in SMT solvers, in which generally various...
متن کامل